{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "# Entity Service Similarity Scores Output\n", "\n", "This tutorial demonstrates generating CLKs from PII, creating a new project on the entity service, and how to retrieve the results. \n", "The output type is raw similarity scores. This output type is particularly useful for determining a good threshold for the greedy solver used in mapping.\n", "\n", "The sections are usually run by different participants - but for illustration all is carried out in this one file. The participants providing data are *Alice* and *Bob*, and the analyst is acting as the integration authority.\n", "\n", "### Who learns what?\n", "\n", "Alice and Bob will both generate and upload their CLKs.\n", "\n", "The analyst - who creates the linkage project - learns the `similarity scores`. Be aware that this is a lot of information and are subject to frequency attacks.\n", "\n", "### Steps\n", "\n", "* Check connection to Entity Service\n", "* Data preparation\n", " * Write CSV files with PII\n", " * Create a Linkage Schema\n", "* Create Linkage Project\n", "* Generate CLKs from PII\n", "* Upload the PII\n", "* Create a run\n", "* Retrieve and analyse results" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import json\n", "import os\n", "import time\n", "import pandas as pd\n", "\n", "import matplotlib.pyplot as plt\n", "import requests\n", "import anonlinkclient.rest_client\n", "from IPython.display import clear_output" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Check Connection\n", "\n", "If you are connecting to a custom entity service, change the address here." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing anonlink-entity-service hosted at https://anonlink.easd.data61.xyz\n" ] } ], "source": [ "url = os.getenv(\"SERVER\", \"https://anonlink.easd.data61.xyz\")\n", "print(f'Testing anonlink-entity-service hosted at {url}')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"project_count\": 845, \"rate\": 593838, \"status\": \"ok\"}\r\n" ] } ], "source": [ "!anonlink status --server \"{url}\"" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Data preparation\n", "\n", "Following the [anonlink client command line tutorial](https://anonlink-client.readthedocs.io/en/latest/tutorial/tutorial_cli.html) we will use a dataset from the `recordlinkage` library. We will just write both datasets out to temporary CSV files.\n", "\n", "If you are following along yourself you may have to adjust the file names in all the `!anonlink` commands." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "from tempfile import NamedTemporaryFile\n", "from recordlinkage.datasets import load_febrl4" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
given_namesurnamestreet_numberaddress_1address_2suburbpostcodestatedate_of_birthsoc_sec_id
rec_id
rec-1070-orgmichaelaneumann8stanley streetmiamiwinston hills4223nsw191511115304218
rec-1016-orgcourtneypainter12pinkerton circuitbega flatsrichlands4560vic191612144066625
rec-4405-orgcharlesgreen38salkauskas crescentkeladapto4566nsw194809304365168
\n", "
" ], "text/plain": [ " given_name surname street_number address_1 \\\n", "rec_id \n", "rec-1070-org michaela neumann 8 stanley street \n", "rec-1016-org courtney painter 12 pinkerton circuit \n", "rec-4405-org charles green 38 salkauskas crescent \n", "\n", " address_2 suburb postcode state date_of_birth \\\n", "rec_id \n", "rec-1070-org miami winston hills 4223 nsw 19151111 \n", "rec-1016-org bega flats richlands 4560 vic 19161214 \n", "rec-4405-org kela dapto 4566 nsw 19480930 \n", "\n", " soc_sec_id \n", "rec_id \n", "rec-1070-org 5304218 \n", "rec-1016-org 4066625 \n", "rec-4405-org 4365168 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfA, dfB = load_febrl4()\n", "\n", "a_csv = NamedTemporaryFile('w')\n", "a_clks = NamedTemporaryFile('w', suffix='.json')\n", "dfA.to_csv(a_csv)\n", "a_csv.seek(0)\n", "\n", "b_csv = NamedTemporaryFile('w')\n", "b_clks = NamedTemporaryFile('w', suffix='.json')\n", "dfB.to_csv(b_csv)\n", "b_csv.seek(0)\n", "\n", "dfA.head(3)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Schema Preparation\n", "\n", "The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns `rec_id` and `soc_sec_id` for CLK generation." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "schema = NamedTemporaryFile('wt')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting /tmp/tmprrzuvk7f\n" ] } ], "source": [ "%%writefile {schema.name}\n", "{\n", " \"version\": 3,\n", " \"clkConfig\": {\n", " \"l\": 1024,\n", " \"xor_folds\": 0,\n", " \"kdf\": {\n", " \"type\": \"HKDF\",\n", " \"hash\": \"SHA256\",\n", " \"info\": \"c2NoZW1hX2V4YW1wbGU=\",\n", " \"salt\": \"SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==\",\n", " \"keySize\": 64\n", " }\n", " },\n", " \"features\": [\n", " {\n", " \"identifier\": \"rec_id\",\n", " \"ignored\": true\n", " },\n", " {\n", " \"identifier\": \"given_name\",\n", " \"format\": {\n", " \"type\": \"string\",\n", " \"encoding\": \"utf-8\"\n", " },\n", " \"hashing\": {\n", " \"strategy\": {\n", " \"bitsPerFeature\": 200\n", " },\n", " \"hash\": {\n", " \"type\": \"doubleHash\"\n", " },\n", " \"comparison\": {\n", " \"type\": \"ngram\",\n", " \"n\": 2,\n", " \"positional\": false\n", " }\n", " }\n", " },\n", " {\n", " \"identifier\": \"surname\",\n", " \"format\": {\n", " \"type\": \"string\",\n", " \"encoding\": \"utf-8\"\n", " },\n", " \"hashing\": {\n", " \"strategy\": {\n", " \"bitsPerFeature\": 200\n", " },\n", " \"hash\": {\n", " \"type\": \"doubleHash\"\n", " },\n", " \"comparison\": {\n", " \"type\": \"ngram\",\n", " \"n\": 2,\n", " \"positional\": false\n", " }\n", " }\n", " },\n", " {\n", " \"identifier\": \"street_number\",\n", " \"format\": {\n", " \"type\": \"integer\"\n", " },\n", " \"hashing\": {\n", " \"missingValue\": {\n", " \"sentinel\": \"\"\n", " },\n", " \"strategy\": {\n", " \"bitsPerFeature\": 100\n", " },\n", " \"hash\": {\n", " \"type\": \"doubleHash\"\n", " },\n", " \"comparison\": {\n", " \"type\": \"ngram\",\n", " \"n\": 1,\n", " \"positional\": true\n", " }\n", " }\n", " },\n", " {\n", " \"identifier\": \"address_1\",\n", " \"format\": {\n", " \"type\": \"string\",\n", " \"encoding\": \"utf-8\"\n", " },\n", " \"hashing\": {\n", " \"strategy\": {\n", " \"bitsPerFeature\": 100\n", " },\n", " \"hash\": {\n", " \"type\": \"doubleHash\"\n", " },\n", " \"comparison\": {\n", " \"type\": \"ngram\",\n", " \"n\": 2,\n", " \"positional\": false\n", " }\n", " }\n", " },\n", " {\n", " \"identifier\": \"address_2\",\n", " \"format\": {\n", " \"type\": \"string\",\n", " \"encoding\": \"utf-8\"\n", " },\n", " \"hashing\": {\n", " \"strategy\": {\n", " \"bitsPerFeature\": 100\n", " },\n", " \"hash\": {\n", " \"type\": \"doubleHash\"\n", " },\n", " \"comparison\": {\n", " \"type\": \"ngram\",\n", " \"n\": 2,\n", " \"positional\": false\n", " }\n", " }\n", " },\n", " {\n", " \"identifier\": \"suburb\",\n", " \"format\": {\n", " \"type\": \"string\",\n", " \"encoding\": \"utf-8\"\n", " },\n", " \"hashing\": {\n", " \"strategy\": {\n", " \"bitsPerFeature\": 100\n", " },\n", " \"hash\": {\n", " \"type\": \"doubleHash\"\n", " },\n", " \"comparison\": {\n", " \"type\": \"ngram\",\n", " \"n\": 2,\n", " \"positional\": false\n", " }\n", " }\n", " },\n", " {\n", " \"identifier\": \"postcode\",\n", " \"format\": {\n", " \"type\": \"integer\",\n", " \"minimum\": 100,\n", " \"maximum\": 9999\n", " },\n", " \"hashing\": {\n", " \"strategy\": {\n", " \"bitsPerFeature\": 100\n", " },\n", " \"hash\": {\n", " \"type\": \"doubleHash\"\n", " },\n", " \"comparison\": {\n", " \"type\": \"ngram\",\n", " \"n\": 1,\n", " \"positional\": true\n", " }\n", " }\n", " },\n", " {\n", " \"identifier\": \"state\",\n", " \"format\": {\n", " \"type\": \"string\",\n", " \"encoding\": \"utf-8\",\n", " \"maxLength\": 3\n", " },\n", " \"hashing\": {\n", " \"strategy\": {\n", " \"bitsPerFeature\": 100\n", " },\n", " \"hash\": {\n", " \"type\": \"doubleHash\"\n", " },\n", " \"comparison\": {\n", " \"type\": \"ngram\",\n", " \"n\": 2,\n", " \"positional\": false\n", " }\n", " }\n", " },\n", " {\n", " \"identifier\": \"date_of_birth\",\n", " \"format\": {\n", " \"type\": \"integer\"\n", " },\n", " \"hashing\": {\n", " \"missingValue\": {\n", " \"sentinel\": \"\"\n", " },\n", " \"strategy\": {\n", " \"bitsPerFeature\": 200\n", " },\n", " \"hash\": {\n", " \"type\": \"doubleHash\"\n", " },\n", " \"comparison\": {\n", " \"type\": \"ngram\",\n", " \"n\": 1,\n", " \"positional\": true\n", " }\n", " }\n", " },\n", " {\n", " \"identifier\": \"soc_sec_id\",\n", " \"ignored\": true\n", " }\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Create Linkage Project\n", "\n", "The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Credentials will be saved in /tmp/tmp1h8qppks\n", "\u001b[31mProject created\u001b[0m\n" ] }, { "data": { "text/plain": [ "{'project_id': '8f8347b3e97665ebc87f4a1744a2a62e0ae4c999184bc754',\n", " 'result_token': 'f6ef6b121f3e5861bceddc36cf1cfdebf9c25a2352937c90',\n", " 'update_tokens': ['192e38e1f9e773b945c882799e5490502b9454c711b66e2d',\n", " 'f1ca0cbbdc3055c731e898a4ebe6121be9e8d82541fb78fd']}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "creds = NamedTemporaryFile('wt')\n", "print(\"Credentials will be saved in\", creds.name)\n", "\n", "!anonlink create-project \\\n", " --schema \"{schema.name}\" \\\n", " --output \"{creds.name}\" \\\n", " --type \"similarity_scores\" \\\n", " --server \"{url}\"\n", "\n", "creds.seek(0)\n", "\n", "with open(creds.name, 'r') as f:\n", " credentials = json.load(f)\n", "\n", "project_id = credentials['project_id']\n", "credentials" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "**Note:** the analyst will need to pass on the `project_id` (the id of the linkage project) and one of the two `update_tokens` to each data provider.\n", "\n", "## Hash and Upload\n", "\n", "At the moment both data providers have *raw* personally identiy information. We first have to generate CLKs from the raw entity information. Please see [clkhash](https://clkhash.readthedocs.io/) documentation for further details on this." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31mCLK data written to /tmp/tmp63vp_3mj.json\u001b[0m\n", "\u001b[31mCLK data written to /tmp/tmpr4cqqglj.json\u001b[0m\n" ] } ], "source": [ "!anonlink hash \"{a_csv.name}\" secret \"{schema.name}\" \"{a_clks.name}\"\n", "!anonlink hash \"{b_csv.name}\" secret \"{schema.name}\" \"{b_clks.name}\"" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "Now the two clients can upload their data providing the appropriate *upload tokens*." ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "### Alice uploads her data" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "with NamedTemporaryFile('wt') as f:\n", " !anonlink upload \\\n", " --project=\"{project_id}\" \\\n", " --apikey=\"{credentials['update_tokens'][0]}\" \\\n", " --server \"{url}\" \\\n", " --output \"{f.name}\" \\\n", " \"{a_clks.name}\"\n", " res = json.load(open(f.name))\n", " alice_receipt_token = res['receipt_token']" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "Every upload gets a receipt token. In some operating modes this receipt is required to access the results." ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "### Bob uploads his data" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "with NamedTemporaryFile('wt') as f:\n", " !anonlink upload \\\n", " --project=\"{project_id}\" \\\n", " --apikey=\"{credentials['update_tokens'][1]}\" \\\n", " --server \"{url}\" \\\n", " --output \"{f.name}\" \\\n", " \"{b_clks.name}\"\n", " \n", " bob_receipt_token = json.load(open(f.name))['receipt_token']" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Create a run\n", "\n", "Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "with NamedTemporaryFile('wt') as f:\n", " !anonlink create \\\n", " --project=\"{project_id}\" \\\n", " --apikey=\"{credentials['result_token']}\" \\\n", " --server \"{url}\" \\\n", " --threshold 0.75 \\\n", " --output \"{f.name}\"\n", " \n", " run_id = json.load(open(f.name))['run_id']" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Results\n", "\n", "Now after some delay (depending on the size) we can fetch the result.\n", "This can be done with `anonlink`:\n", "\n", " !anonlink results --server \"{url}\" \\\n", " --project=\"{credentials['project_id']}\" \\\n", " --apikey=\"{credentials['result_token']}\" --output results.txt\n", " \n", "However for this tutorial we are going to use the `anonlinkclient.rest_client` module:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "State: completed\n", "Stage (2/2): compute similarity scores\n", "Progress: 100.00%\n" ] } ], "source": [ "from anonlinkclient.rest_client import RestClient\n", "from anonlinkclient.rest_client import format_run_status\n", "rest_client = RestClient(url)\n", "for update in rest_client.watch_run_status(project_id, run_id, credentials['result_token'], timeout=300):\n", " clear_output(wait=True)\n", " print(format_run_status(update))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "data = json.loads(rest_client.run_get_result_text(\n", " project_id, \n", " run_id, \n", " credentials['result_token']))['similarity_scores']" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "This result is a large list of tuples recording the similarity between all rows above the given threshold." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0, 76], [1, 2345], 1.0]\n", "[[0, 83], [1, 3439], 1.0]\n", "[[0, 103], [1, 863], 1.0]\n", "[[0, 154], [1, 2391], 1.0]\n", "[[0, 177], [1, 4247], 1.0]\n", "[[0, 192], [1, 1176], 1.0]\n", "[[0, 270], [1, 4516], 1.0]\n", "[[0, 312], [1, 1253], 1.0]\n", "[[0, 407], [1, 3743], 1.0]\n", "[[0, 670], [1, 3550], 1.0]\n" ] } ], "source": [ "for row in data[:10]:\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "Note there can be a lot of similarity scores:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/plain": [ "280116" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(data)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "We will display a *sample* of these similarity scores in a histogram using matplotlib:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.style.use('seaborn-deep')\n", "plt.hist([score for _, _, score in data], bins=50)\n", "plt.xlabel('similarity score')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "The vast majority of these similarity scores are for non matches. We expect the matches to have a high similarity score. So let's zoom into the right side of the distribution." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEGCAYAAACevtWaAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAT/ElEQVR4nO3df7SlVX3f8fcnINBllAGZzqIzjEN0GjNtlsiahSTGhEpUwOCQVKgm1ZFOOjGLuNpKk2LtMrbLP+IyRjFpTWaBcTD+AG0so6VaMog2tmAG+S1RRoQwI8jIr4TiL/TbP86echzunXvuveeee2ff92utu+7z7Gc/5+yzuXzOnn2eZ59UFZKkvvzYYjdAkjR+hrskdchwl6QOGe6S1CHDXZI6dPhiNwDguOOOq3Xr1i12MyTpkHLDDTd8q6pWTnVsSYT7unXr2LVr12I3Q5IOKUnume6Y0zKS1CHDXZI6ZLhLUocMd0nqkOEuSR0aKdyTrEjy8SR/neSOJD+T5NgkVye5s/0+ptVNkvcm2Z3kliQnL+xLkCQdaNSR+8XAp6vqecDzgTuAi4CdVbUe2Nn2Ac4E1refrcD7xtpiSdKMZgz3JEcDPw9cClBV36uqR4BNwPZWbTtwTtveBFxWA9cBK5IcP/aWS5KmNcrI/URgH/CnSW5MckmSpwOrquq+Vud+YFXbXg3cO3T+nlb2I5JsTbIrya59+/bN/RVIkp5ilDtUDwdOBt5YVdcnuZgnp2AAqKpKMqtv/aiqbcA2gI0bN875G0POvvDKKcs/+a5Nc31ISTrkjTJy3wPsqarr2/7HGYT9N/dPt7TfD7Tje4EThs5f08okSRMyY7hX1f3AvUl+shWdDnwZ2AFsbmWbgf1D6B3A69pVM6cCjw5N30iSJmDUhcPeCHwoyRHAXcD5DN4YrkiyBbgHOK/VvQo4C9gNPN7qSpImaKRwr6qbgI1THDp9iroFXDDPdkmS5sE7VCWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQyOFe5K7k9ya5KYku1rZsUmuTnJn+31MK0+S9ybZneSWJCcv5AuQJD3VbEbu/6SqTqqqjW3/ImBnVa0HdrZ9gDOB9e1nK/C+cTVWkjSa+UzLbAK2t+3twDlD5ZfVwHXAiiTHz+N5JEmzNGq4F/A/k9yQZGsrW1VV97Xt+4FVbXs1cO/QuXtamSRpQg4fsd7PVdXeJH8fuDrJXw8frKpKUrN54vYmsRVg7dq1szlVkjSDkUbuVbW3/X4A+ARwCvDN/dMt7fcDrfpe4ISh09e0sgMfc1tVbayqjStXrpz7K5AkPcWM4Z7k6UmesX8beBlwG7AD2NyqbQaubNs7gNe1q2ZOBR4dmr6RJE3AKNMyq4BPJNlf/8NV9ekkfwVckWQLcA9wXqt/FXAWsBt4HDh/7K2WJB3UjOFeVXcBz5+i/EHg9CnKC7hgLK2TJM2Jd6hKUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nq0MjhnuSwJDcm+VTbPzHJ9Ul2J7k8yRGt/Mi2v7sdX7cwTZckTWc2I/d/BdwxtP8O4N1V9VzgYWBLK98CPNzK393qSZImaKRwT7IGeAVwSdsP8BLg463KduCctr2p7dOOn97qS5ImZNSR+3uA3wF+2PafBTxSVU+0/T3A6ra9GrgXoB1/tNX/EUm2JtmVZNe+ffvm2HxJ0lRmDPckvwQ8UFU3jPOJq2pbVW2sqo0rV64c50NL0rJ3+Ah1XgS8MslZwFHAM4GLgRVJDm+j8zXA3lZ/L3ACsCfJ4cDRwINjb7kkaVozjtyr6s1Vtaaq1gGvBq6pql8DPgu8qlXbDFzZtne0fdrxa6qqxtpqSdJBzec6938HvCnJbgZz6pe28kuBZ7XyNwEXza+JkqTZGmVa5v+rqmuBa9v2XcApU9T5DnDuGNomSZoj71CVpA4Z7pLUIcNdkjo0qzn3Q8nZF145Zfkn37Vpwi2RpMlz5C5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ90uPyBJkzbdsicw+aVPHLlLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdmjHckxyV5ItJbk5ye5L/2MpPTHJ9kt1JLk9yRCs/su3vbsfXLexLkCQdaJSR+3eBl1TV84GTgDOSnAq8A3h3VT0XeBjY0upvAR5u5e9u9SRJEzRjuNfAY233ae2ngJcAH2/l24Fz2vamtk87fnqSjK3FkqQZjTTnnuSwJDcBDwBXA18DHqmqJ1qVPcDqtr0auBegHX8UeNYUj7k1ya4ku/bt2ze/VyFJ+hEjhXtV/aCqTgLWAKcAz5vvE1fVtqraWFUbV65cOd+HkyQNmdXVMlX1CPBZ4GeAFUn2f9nHGmBv294LnADQjh8NPDiW1kqSRjLK1TIrk6xo238PeClwB4OQf1WrthnY/xUkO9o+7fg1VVXjbLQk6eBG+Zq944HtSQ5j8GZwRVV9KsmXgY8meTtwI3Bpq38p8MEku4GHgFcvQLslSQcxY7hX1S3AC6Yov4vB/PuB5d8Bzh1L6yRJc+IdqpLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDo1znLkmHlLMvvHLK8k++a9OEW7J4HLlLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDnkppCTN0nSXWi4ljtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ17lLWjaW01LAhrukZa/H0HdaRpI6ZLhLUodmnJZJcgJwGbAKKGBbVV2c5FjgcmAdcDdwXlU9nCTAxcBZwOPA66vqSwvTfEnL2aGwxstiGWXk/gRwYVVtAE4FLkiyAbgI2FlV64GdbR/gTGB9+9kKvG/srZYkHdSM4V5V9+0feVfV3wF3AKuBTcD2Vm07cE7b3gRcVgPXASuSHD/2lkuSpjWrOfck64AXANcDq6rqvnbofgbTNjAI/nuHTtvTyiRJEzLypZBJfhz4r8C/rqq/HUytD1RVJanZPHGSrQymbVi7du1sTpXUKefQx2ekcE/yNAbB/qGq+vNW/M0kx1fVfW3a5YFWvhc4Yej0Na3sR1TVNmAbwMaNG2f1xiBJk3Aov9nMOC3Trn65FLijqv5g6NAOYHPb3gxcOVT+ugycCjw6NH0jSZqAUUbuLwJeC9ya5KZW9u+B3wOuSLIFuAc4rx27isFlkLsZXAp5/lhbLEma0YzhXlV/CWSaw6dPUb+AC+bZLknSPLi2jKSJO5Tnsg8VLj8gSR1y5C5pwThCXzyO3CWpQ4a7JHXIcJekDhnuktShZfeB6sE+4DmUv1JLkoY5cpekDhnuktShZTctI2n8vJ596XHkLkkdcuQuSRMw3b9uFupCDsNdWqYmHTaaLKdlJKlDhrskdchwl6QOGe6S1CE/UJU0Eq9lP7Q4cpekDhnuktQhw12SOuScu7REjWuOe7Y3JTm33gfDXVpkhqkWguEudc43j+XJOXdJ6tCMI/ck7wd+CXigqv5xKzsWuBxYB9wNnFdVDycJcDFwFvA48Pqq+tLCNF1amlyQS0vBKCP3DwBnHFB2EbCzqtYDO9s+wJnA+vazFXjfeJopSZqNGcO9qj4PPHRA8SZge9veDpwzVH5ZDVwHrEhy/LgaK0kazVw/UF1VVfe17fuBVW17NXDvUL09rew+DpBkK4PRPWvXrp1jM8bLf05L6sW8P1CtqgJqDudtq6qNVbVx5cqV822GJGnIXMP9m/unW9rvB1r5XuCEoXprWpkkaYLmOi2zA9gM/F77feVQ+W8l+SjwQuDRoekbaVnzenNN0iiXQn4EOA04Lske4HcZhPoVSbYA9wDntepXMbgMcjeDSyHPX4A2S0uCYa2lbMZwr6rXTHPo9CnqFnDBfBslSZoflx+Q8Eop9cflBySpQ47cpYNwXl2HKkfuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUNeCqlZ8WYf6dBguI/AQDv0+N9My53TMpLUIUfu8+DocPHN9g5S7zjVcmG4d2ScbzaGoHRoM9y1oMb1huObjTQ7hvsysBSD0SktaWEZ7gtgoYPrUAprSYvDq2UkqUOO3JcApyie5L8ApPEw3Jcwg07SXDktI0kdMtwlqUOGuyR1yDn3CXIOXdKkOHKXpA4Z7pLUoQUJ9yRnJPlKkt1JLlqI55AkTW/s4Z7kMOA/A2cCG4DXJNkw7ueRJE1vIUbupwC7q+quqvoe8FFg+d1qKUmLaCGullkN3Du0vwd44YGVkmwFtrbdx5J8ZQHaMpPjgG8twvMeKuyfmdlHB2f/zCB/MK8+evZ0BxbtUsiq2gZsW6znB0iyq6o2LmYbljL7Z2b20cHZPzNbqD5aiGmZvcAJQ/trWpkkaUIWItz/Clif5MQkRwCvBnYswPNIkqYx9mmZqnoiyW8BnwEOA95fVbeP+3nGZFGnhQ4B9s/M7KODs39mtiB9lKpaiMeVJC0i71CVpA4Z7pLUoS7DfablD5KsTfLZJDcmuSXJWUPH3tzO+0qSl0+25ZMz1z5Ksi7Jt5Pc1H7+ePKtX3gj9M+zk+xsfXNtkjVDxzYnubP9bJ5syydnnn30g6G/oS4vuEjy/iQPJLltmuNJ8t7Wf7ckOXno2Pz/hqqqqx8GH+J+DfgJ4AjgZmDDAXW2Ab/ZtjcAdw9t3wwcCZzYHuewxX5NS6yP1gG3LfZrWAL98zFgc9t+CfDBtn0scFf7fUzbPmaxX9NS6qO2/9hiv4YJ9NHPAydP9/8LcBbwP4AApwLXj/NvqMeR+yjLHxTwzLZ9NPCNtr0J+GhVfbeqvg7sbo/Xm/n00XIwSv9sAK5p258dOv5y4OqqeqiqHgauBs6YQJsnbT59tCxU1eeBhw5SZRNwWQ1cB6xIcjxj+hvqMdynWv5g9QF13gb88yR7gKuAN87i3B7Mp48ATmzTNZ9L8uIFbeniGKV/bgZ+pW3/MvCMJM8a8dwezKePAI5KsivJdUnOWdimLlnT9eFY/oZ6DPdRvAb4QFWtYfBPow8mWa59MZ3p+ug+YG1VvQB4E/DhJM88yOP06t8Cv5DkRuAXGNyF/YPFbdKSc7A+enYNbrn/VeA9SZ6zSG3sVo+BNsryB1uAKwCq6v8ARzFY4Gi5LJ0w5z5qU1YPtvIbGMy7/sMFb/Fkzdg/VfWNqvqV9ib3llb2yCjndmI+fURV7W2/7wKuBV4wgTYvNdP14Vj+hnoM91GWP/gb4HSAJD/FILj2tXqvTnJkkhOB9cAXJ9byyZlzHyVZ2dbsJ8lPMOijuybW8smYsX+SHDf0r703A+9v258BXpbkmCTHAC9rZb2Zcx+1vjlyfx3gRcCXJ9bypWMH8Lp21cypwKNVdR/j+hta7E+UF+hT6rOArzIYVb6llf0n4JVtewPwBQZzgjcBLxs69y3tvK8AZy72a1lqfQT8U+D2VvYl4OzFfi2L1D+vAu5sdS4Bjhw6918w+DB+N3D+Yr+WpdZHwM8Ct7a/rVuBLYv9Whaofz7CYBrz+wzmzbcAbwDe0I6HwRcbfa31w8Zx/g25/IAkdajHaRlJWvYMd0nqkOEuSR0y3CWpQ4a7JHXIcNeSluSSJBtmUX9jkve27dcn+aNZPt/w+acl+dnZtVhaGsb+NXvSOFXVr8+y/i5g11yeK8nhB5x/GvAY8L/n8njjkOSwqnJZA82aI3ctCUmenuS/J7k5yW1J/lkrvzbJxrb9WJJ3Jrk9yV8kOaUdvyvJK1ud05J8aorHPzvJ9W3Bs79IsqqVvy3JB5N8gcH6Oacl+VSSdQxuOPk3bc3xFyf5epKntfOeObw/9DzntvbfnOTzreywJL/fym9J8sZWfnprz60ZrP29/67Nu5O8I8mXgHOTPCfJp5PckOR/JXneQvw3UF8cuWupOAP4RlW9AiDJ0VPUeTpwTVX9dpJPAG8HXsrgbtrtPHUJhWF/CZxaVZXk14HfAS5sxzYAP1dV305yGkBV3Z3BF5E8VlW/39p0LfAK4L8xuN3+z6vq+wc8z1uBl1fV3iQrWtlWBuvgn1SDL5A/NslRwAeA06vqq0kuA34TeE8758GqOrk9704GdzXemeSFwH9hsD66NC1H7loqbgVe2kasL66qR6eo8z3g00P1P9fC9VYG4Xkwa4DPJLkV+G3gHw0d21FV3x6hjZcA57ft84E/naLOF4APJPmXDL7QAuAXgT+pqicAquoh4CeBr1fVV1ud7Qy+3GG/ywGS/DiD2/U/luQm4E+A40doq5Y5w11LQgu5kxkE9duTvHWKat+vJ9fL+CHw3XbuD5n5X6F/CPxRVf008BsMFkLb7/+O2MYvAOva6P6wqnrK16dV1RuA/8BgVb8b8uT65bO1v00/BjxSVScN/fzUHB9Ty4jhriUhyT8AHq+qPwPeySDox+lonlw2ddTvpPw74BkHlF0GfJipR+0keU5VXV9Vb2Ww0ugJDL5J5zeSHN7qHMtgYbp1SZ7bTn0t8LkDH6+q/hb4epJz27lJ8vwR269lzHDXUvHTwBfb1MPvMphPH6e3MZjauAH41ojnfBL45f0fqLayDzH4XsuPTHPOO9sHpLcxuMrmZgbTOX8D3JLkZuBXq+o7DKZ2Ptamin4ITPdl478GbGnn3s4y+7o6zY2rQkqzkORVwKaqeu1it0U6GK+WkUaU5A+BMxmsYy4taY7cJalDzrlLUocMd0nqkOEuSR0y3CWpQ4a7JHXo/wGLDuOcFHH2SgAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.hist([score for _, _, score in data if score >= 0.79], bins=50);\n", "plt.xlabel('similarity score')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, there is a cluster of scores between 0.9 and 1.0. To better visualize that these are indeed the scores for the matches, we will now extract the true_matches from the datasets and group the similarity scores into those for the matches and the non-matches (We can do this because we know the ground truth of the dataset)." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# rec_id in dfA has the form 'rec-1070-org'. We only want the number. Additionally, as we are\n", "# interested in the position of the records, we create a new index which contains the row numbers.\n", "dfA_ = dfA.rename(lambda x: x[4:-4], axis='index').reset_index()\n", "dfB_ = dfB.rename(lambda x: x[4:-6], axis='index').reset_index()\n", "# now we can merge dfA_ and dfB_ on the record_id.\n", "a = pd.DataFrame({'ida': dfA_.index, 'rec_id': dfA_['rec_id']})\n", "b = pd.DataFrame({'idb': dfB_.index, 'rec_id': dfB_['rec_id']})\n", "dfj = a.merge(b, on='rec_id', how='inner').drop(columns=['rec_id'])\n", "# and build a set of the corresponding row numbers.\n", "true_matches = set((row[0], row[1]) for row in dfj.itertuples(index=False))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "scores_matches = []\n", "scores_non_matches = []\n", "for (_, a), (_, b), score in data:\n", " if score < 0.79:\n", " continue\n", " if (a, b) in true_matches:\n", " scores_matches.append(score)\n", " else:\n", " scores_non_matches.append(score)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.hist([scores_matches, scores_non_matches], bins=50, label=['matches', 'non-matches'])\n", "plt.legend(loc='upper right')\n", "plt.xlabel('similarity score')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "We can see that the similarity scores for the matches and the ones for the non-matches form two different distributions. With a suitable linkage schema, these two distributions hardly overlap. \n", "\n", "When choosing a similarity threshold for solving, the valley between these two distributions is a good starting point. In this example, it is around 0.82. We can see that almost all similarity scores above 0.82 are from matches, thus the solver will produce a linkage result with high precision. However, recall will not be optimal, as there are still some scores from matches below 0.82. By moving the threshold to either side, you can favour either precision or recall." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31mProject deleted\u001b[0m\r\n" ] } ], "source": [ "# Deleting the project\n", "!anonlink delete-project --project=\"{credentials['project_id']}\" \\\n", " --apikey=\"{credentials['result_token']}\" \\\n", " --server=\"{url}\"" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }